Food insecurity is one of the most significant environmental justice challenges in the United States, with more than 42 million people. Approximately 10.5% of US households experience some form of food insecurity. Hunger in America has been exacerbated by the COVID-19 pandemic, impacting families already facing hunger the most. Before the pandemic, more than 12 million children lived in food-insecure households, with that number now increasing to 13 million. BIPOC communities face the highest rates of starvation and hunger in the nation. 11.5% of people identify as food insecure within the Bay Area, with only 38% of them qualifying for food stamps. There are many ways to quantify food insecurity but easy access to supermarkets is what we will be focusing on.
The USDA has developed a food access database that presents data by census tract for measures of supermarket accessibility. We aim to compare Alameda County, one of the areas facing greatest food insecurity in the Bay, with San Francisco County. Both are equally urban and densely populated areas but have drastically different food health and food access issues.
Is there a statistical correlation between race, SNAP eligibility, and food access? What is the relationship between race and SNAP eligibility? What is the relationship between food access and income? What is the relationship between cardiovascular health and income level? Through those questions, we will draw conclusions between race, SNAP, health metrics, and income. We chose SNAP because it sits at the intersection of food and income in a single variable. Also, we acknowledge that this is not an exclusively urban problem (there is much evidence of food insecurity in rural areas), however, the urban setting exacerbates a lot of the issues detailed above.
##SNAP Eligibility per County in the Bay Area (grouped by county eligibility)
We found that Alameda County, Santa Clara, Contra Cost, and San Francisco have the highest number of qualifying households in the Bay Area. Moving on to our equity analysis, we will choose to narrow down to just Alameda County and San Francisco county because of their shared urban density and their differing food health and food access issues which may make them the most interesting to compare.
##Equity Analysis of SNAP Eligibility by Race
Compared the totals, the proportion of white people qualifying for SNAP decreased in both counties, the proportion of Black or African American increased in both counties. In San Francisco, the proportion of Asian people qualifying for SNAP increased slightly, whereas in Alameda county it decreased significantly. Some other race alone, native Hawaiian, American Indian and Alaska Native alone, and two or more races increased in both counties. This is not suprising, and the breakdown follows national trends (proportion of white being greatest, then Black/African American, then Hispanic and Asian). Due to our findings, we will be using Black or African American as our focus racial group from now on (health effects only). Though our results would more likely be different if we included ethnicity, for the purpose of this analysis, we will just be concentrating on race.
##Correlation between building type, SNAP allocation, income and tenure (by PUMAs)
Let’s return to ACS data and compare four different variables in the Bay Area at the tract level: building type, SNAP allocation (was this household allocated SNAP), tenure (owned or rented) and income. Our hypothesis is that some of these variables would be correlated––for example, home ownership below a certain income bracket is very uncommon, thus we could correlate income to home ownership. Common sense (which we have learnt not to trust) tells us that these outcomes are likely to be related: if one is high for a random census tract, we imagine the other would also be high.
To do so, we created a new binary variable, named allocated, in which income is necessarily below 66k/yr and the household was allocated SNAP benefits. This allows us to control for income which is essential, given that is the main definition for SNAP eligibility, and we are interested in investigating additional explanatory power besides income.
Our results for our logit model are below.
##
## Call:
## glm(formula = allocated ~ building + tenure + kitchen + puma,
## family = quasibinomial(), data = bay_pums_factored)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6194 -0.2748 -0.2189 -0.1757 3.4737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.16970 0.58825 -3.688 0.000228 ***
## building2 -1.02120 0.47519 -2.149 0.031678 *
## building3 -1.23964 0.56015 -2.213 0.026936 *
## building4 -0.64073 0.55375 -1.157 0.247293
## building5 -1.20333 0.55903 -2.153 0.031401 *
## building6 -1.06045 0.56704 -1.870 0.061520 .
## building7 -1.44002 0.61826 -2.329 0.019889 *
## building8 -1.43931 0.59055 -2.437 0.014833 *
## building9 -1.35707 0.55071 -2.464 0.013763 *
## building10 -15.49089 2904.25614 -0.005 0.995744
## tenure2 -0.89760 0.29275 -3.066 0.002179 **
## tenure3 -0.42106 0.24000 -1.754 0.079414 .
## tenure4 0.32974 0.43785 0.753 0.451431
## kitchen2 -1.59955 1.03580 -1.544 0.122584
## puma00102 -0.01476 0.42568 -0.035 0.972349
## puma00103 -0.07534 0.56326 -0.134 0.893593
## puma00104 0.35436 0.40276 0.880 0.378989
## puma00105 -0.30853 0.48853 -0.632 0.527711
## puma00106 -2.08315 1.06081 -1.964 0.049612 *
## puma00107 0.40156 0.42194 0.952 0.341290
## puma00108 0.03648 0.51598 0.071 0.943641
## puma00109 0.36878 0.45594 0.809 0.418648
## puma00110 0.10049 0.48883 0.206 0.837137
## puma07501 1.25659 0.40314 3.117 0.001837 **
## puma07502 0.93628 0.45753 2.046 0.040767 *
## puma07503 0.36615 0.51376 0.713 0.476062
## puma07504 -14.88140 467.16184 -0.032 0.974589
## puma07505 -0.46167 0.68132 -0.678 0.498054
## puma07506 -1.74479 1.06334 -1.641 0.100886
## puma07507 0.01798 0.49362 0.036 0.970946
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 1.012113)
##
## Null deviance: 1377.3 on 5387 degrees of freedom
## Residual deviance: 1294.0 on 5358 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 17
Data Dictionary:
Here are the meanings of the outcomes for each of the factors of our model.
Tenure
1.Owned with mortgage or loan (include home equity loans)
2.Owned free and clear
3.Rented
4.Occupied without payment of rent
Kitchen
Complete kitchen facilities
1.Yes, has stove or range, refrigerator, and sink with a faucet
2.No
Building
Units in structure
01.Mobile home or trailer
02.One-family house detached
03.One-family house attached
04.2 Apartments
05.3-4 Apartments
06.5-9 Apartments
07.10-19 Apartments
08.20-49 Apartments
09.50 or more apartments
10.Boat, RV, van, etc.
Results from logit model:
Building Type: there is a statistically significant correlation between building types 2 and 3 with SNAP Allocation + Income. These are one-family houses detached/attached. They have a negative estimate. Most of the buildings in fact, have a negative association with SNAP allocation. The effect size of buildings (effect size of -16.64 to -0.54) have an overall negative relationship.
Tenure: Although not drastically different, renters are more likely to be allocated SNAP than owners. This is expected and support our hypothesis. The effect size of tenure (effect size ranging from -1.94 to -1.90) has an overall negative relationship.
Kitchen: Outcome 1 signifies “has a kitchen” and the results are statistically insignificant. Outcome 2 means “has no kitchen” and the NA data is most likely due to high levels of colinearity between two or more varaibles. This is incredibly important, though sadly, not statistically significant. Though, common sense would lead to that same conclusion. The effect size of kitchen access (effect size is 0.62) has an overall positive relationship.
PUMA: The two most statistically significant PUMAS are PUMA 00104 and PUMA 00107. Both of which have positive correlation. PUMA 00104 has an effect size of 0.838 which is a positive relationship with SNAP allocation and PUMA 00107 has an effect size of 0.752, which alsl shows a positive relationship with SNAP allocation. Check the map below to see what geographical areas these PUMAs correspond to.
The coefficents are so small, they have effectively centered around 0. That is to say, “controlling” for SNAP aalocation yields a very small association (but still signficant as per the asterisks) between our other variables–income, building, kitchen, and tenure. There may be a big causal mechanism here, but we cannot make claims about the shared variation in observation.
##CalEnviroScreen: correlating Cardiovascular Health and Poverty in Alameda and San Francisco Counties
This graph shows there is a notable difference in Cardiovascular health between San Francisco and Alameda County. Especially, the San Leandro area in Hayward with a score of 21.04.
In comparison to the previous map, this map shows a much more even distributed distribution of poverty households in each county. There are equally as low or high poverty levels in both areas.
Scatter plot does not show a clear relationship, there are several outliars and the points themselves almost appear to be random.
##
## Call:
## lm(formula = `Cardiovascular Disease` ~ Poverty, data = bay_cardio_poverty_tract)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9114 -2.7953 -0.4989 2.0243 10.7705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.43532 0.27951 33.76 < 2e-16 ***
## Poverty 0.05148 0.01046 4.92 1.15e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.692 on 549 degrees of freedom
## Multiple R-squared: 0.04223, Adjusted R-squared: 0.04048
## F-statistic: 24.2 on 1 and 549 DF, p-value: 1.147e-06
As you can see, an increase of Cardiovascular Disease in one unit is associated with an increase of Poverty in 9.435; 4.2% of the variation in Cardiovascular Disease is explained by the variation in Poverty. The p-value of 1.147e-06 is <5% making these results statistically significant.
The graph above is a representation of the distribution of residuals from our model. While the peak is fairly close to 0, it is skewed the left and not evenly distributed on both sides. Thus, we will try to create a logarithmic verison of our model to try to normalize the distribution.
The scatter plot still does not show a clear relationship as there are several outliers: most points do not follow our line of best fit and are not within the margin of error. Points are mostly within the left side of the graph however range the full y-axis.
##
## Call:
## lm(formula = log(`Cardiovascular Disease`) ~ Poverty, data = bay_cardio_poverty_tract)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11942 -0.24454 0.01079 0.22975 0.78557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1918798 0.0263256 83.261 < 2e-16 ***
## Poverty 0.0047181 0.0009856 4.787 2.18e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3477 on 549 degrees of freedom
## Multiple R-squared: 0.04007, Adjusted R-squared: 0.03832
## F-statistic: 22.92 on 1 and 549 DF, p-value: 2.18e-06
In this case, an increase of log(Cardiovascular Disease) in one unit is associated with an increase of Poverty in 2.19; 4.0% of the variation in log(Cardiovascular Disease) is explained by the variation in Poverty. The p-value of 2.18e-06 is <5% making these results statistically significant.
While the results with log are marginally better, they are still not fully normalized but allow us to draw some conclusions with caution. This fairly validates our previous interpretation of the scatter plot and further enforces the lack of correlation between Cardiovascular Disease and Poverty.
A low residual represents a high accuracy of the modeled regression and the data collected, meaning it could be used to make reasonable estimates. A positive residual represents an underestimation while a negative value represents an overestimation.
Residuals ranging from -10 to 10 show a high accuracy of our model compared to the data. The negative residuals (lighter colors) are concentrated around the San Francisco area while the positive residuals (darker colors) are in Alameda County. This shows a higher concentration of overestimation in San Francisco while Alameda has a higher concentration of underestimation.
In human terms, one possible explanation of this trend are the systems of bias in place for these two counties. Perhaps, this could be explained by the idea of communities of concern being more affected by health issues yet have less resources to address them. As a consequence, San Francisco, a high developed and high-income county has an overestimation of the correlation between poverty and health while Alameda, a less developed and lower income county often displays an underestimation of the same correlation. less access to healthcare system so less data? most affected people are missing from data
Lastly, given inconclusive scatter plot this is not very meaningful towards drawing any conclusion.
##Equity Analysis of Individuals within a 1-mile range of a grocery store (food desert)
The USDA defines food deserts as both low income areas and ones in which more than a third of the population at the census tract level lives over a mile from a grocery store or supermarket (10 miles for rural areas). Below is a bar chart detailing the number of individuals whose households are beyond a 1 mile radius of a grocery store by race within our two counties of interest.
This map shows us the population identified as low income and has low access to food options (supermarkets, groceries, and convenience stores).This map is in place of one we think would be more interesting that mapped the individual stores themseleves. Unfortunately, as you can see from the blank tracts on the map, most of the data was NULL meaning it had no data attached to it. This makes this map difficult to anlayze as most of San Francisco and Inner Oakland have no data.
Clearly, Alameda has a drastically larger overall population, making this graph limited. Thus, we have decided to plot a second graph showing %s instead.
This graph however, is very informative and shows that within both San Francisco and Alameda County the race group facing the most food access issues are white people. This isn’t surprising based on general population demographics (similar to our other equity analysis above). Second however, in San Francisco is the Black or African American community while in Alameda it is the Asian community. For next lowest access, those communities are flipped for the two counties and the only other significant race category is Two or more Races.
Next, we decided to plot the same graph however changed the radius to a half mile radius in order to compared the change in race distributions. It is important to note than population estimates for the 1 mile radius are inclusive of those for the half mile radius.
By comparing these two ranges, we have found a significant change in % of Asian population in San Francisco. It it surprising how big a difference the half mile radius expansion makes with this particular group. The % of individuals whose household are beyond a .5 mile radius of a grocery store is significantly higher than that of a 1 mile radius. We can only speculate that this is because in the city of San Francisco there are a couple of concentrations of Asian communities. We do not have any evidence supporting this result but is simply a potential interpretation that came to mind.
Additionally, we tried to do the same for a 10 mile radius but there was no data available in this dataset.
##Reflection
In further work, perhaps next quarter, it would be very interesting to plot the individual grocery stores on a map and layer our equity analysis on top of that to see a really clear relationship between race and food access. Our hypothesis that there was a relationship between food access and race was mostly supported by our analyses, so it would be great to further this exploration with more tools next quarter.
This project gave us the opportunity to delve deeper into a serious issue within the Bay Area using our fall quarter tool kit. Though much of our analysis was pretty surface level, we were still able to create meaningful results with statistical significance.